ARC-AGI Evaluation Summary
This report shows the performance of various submissions evaluated against the 28_15x15_evaluation set.
Overall Results
| Submission | Total Score | # Tasks | Percentage | Mean Pixel Correct | Median Pixel Correct | Wilcoxon p-value (vs. 4o) |
|---|
| submission_4o.json | 8.50 | 28 | 30.36% | 71.51% | 75.00% | Reference |
| submission_4omini.json | 3.50 | 28 | 12.5% | 61.47% | 66.67% | 0.1220 |
| submission_agentswtool.json | 7.50 | 28 | 26.79% | 63.62% | 62.50% | 0.3890 |
| submission_finetune4o.json | 16.00 | 28 | 57.14% | 86.88% | 100.00% | 0.0064 |
| submission_finetune4omini.json | 10.50 | 28 | 37.5% | 73.62% | 80.00% | 0.7878 |
Submission: submission_4o.json
Task: 00576224
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 36 | 100.00% |  |
Task: 17cae0c1
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 18 | 66.67% |  |
Task: 2072aba6
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 26 | 72.22% |  |
Task: 27a77e38
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 80 | 98.77% |  |
Task: 31d5ba1a
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 7 | 46.67% |  |
| 1 | 6 | 40.00% |  |
Task: 34b99a2b
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 15 | 75.00% |  |
Task: 4cd1b7b2
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 16 | 100.00% |  |
Task: 59341089
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 0 | 0.00% | ![]() |
Task: 62b74c02
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 0 | 0.00% | ![]() |
Task: 66e6c45b
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 16 | 100.00% |  |
Task: 66f2d22f
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 18 | 64.29% |  |
Task: 68b67ca3
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 9 | 100.00% |  |
Task: 6ea4a07e
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 9 | 100.00% |  |
| 1 | 9 | 100.00% |  |
Task: 72207abc
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 87 | 100.00% |  |
Task: 8ba14f53
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 8 | 88.89% |  |
Task: a8610ef7
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 23 | 63.89% |  |
Task: aa18de87
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 35 | 97.22% |  |
Task: b1fc8b8e
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 0 | 0.00% | ![]() |
| 1 | 0 | 0.00% | ![]() |
Task: bbb1b8b6
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 16 | 100.00% |  |
| 1 | 10 | 62.50% |  |
Task: be03b35f
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 3 | 75.00% |  |
Task: ca8de6ea
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 9 | 100.00% |  |
Task: d017b73f
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 14 | 58.33% |  |
Task: e133d23d
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 8 | 88.89% |  |
Task: e345f17b
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 12 | 75.00% |  |
| 1 | 8 | 50.00% |  |
Task: e633a9e5
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 25 | 100.00% |  |
Task: ed74f2f2
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 8 | 88.89% |  |
Task: ed98d772
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 20 | 55.56% |  |
Task: fc754716
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 58 | 92.06% |  |
Submission: submission_4omini.json
Task: 00576224
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 36 | 100.00% |  |
Task: 17cae0c1
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 3 | 11.11% |  |
Task: 2072aba6
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 28 | 77.78% |  |
Task: 27a77e38
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 79 | 97.53% |  |
Task: 31d5ba1a
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 8 | 53.33% |  |
| 1 | 10 | 66.67% |  |
Task: 34b99a2b
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 9 | 45.00% |  |
Task: 4cd1b7b2
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 4 | 25.00% |  |
Task: 59341089
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 18 | 50.00% |  |
Task: 62b74c02
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 0 | 0.00% |  |
Task: 66e6c45b
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 16 | 100.00% |  |
Task: 66f2d22f
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 19 | 67.86% |  |
Task: 68b67ca3
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 8 | 88.89% |  |
Task: 6ea4a07e
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 0 | 0.00% |  |
| 1 | 4 | 44.44% |  |
Task: 72207abc
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 86 | 98.85% |  |
Task: 8ba14f53
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 6 | 66.67% |  |
Task: a8610ef7
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 24 | 66.67% |  |
Task: aa18de87
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 34 | 94.44% |  |
Task: b1fc8b8e
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 25 | 100.00% |  |
| 1 | 18 | 72.00% |  |
Task: bbb1b8b6
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 12 | 75.00% |  |
| 1 | 9 | 56.25% |  |
Task: be03b35f
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 2 | 50.00% |  |
Task: ca8de6ea
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 9 | 100.00% |  |
Task: d017b73f
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 0 | 0.00% | ![]() |
Task: e133d23d
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 6 | 66.67% |  |
Task: e345f17b
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 14 | 87.50% |  |
| 1 | 10 | 62.50% |  |
Task: e633a9e5
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 17 | 68.00% |  |
Task: ed74f2f2
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 1 | 11.11% |  |
Task: ed98d772
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 24 | 66.67% |  |
Task: fc754716
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 37 | 58.73% |  |
Submission: submission_agentswtool.json
Task: 00576224
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 36 | 100.00% |  |
Task: 17cae0c1
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 9 | 33.33% |  |
Task: 2072aba6
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 36 | 100.00% |  |
Task: 27a77e38
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 80 | 98.77% |  |
Task: 31d5ba1a
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 9 | 60.00% |  |
| 1 | 7 | 46.67% |  |
Task: 34b99a2b
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 11 | 55.00% |  |
Task: 4cd1b7b2
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 16 | 100.00% |  |
Task: 59341089
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 28 | 77.78% |  |
Task: 62b74c02
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 28 | 50.00% |  |
Task: 66e6c45b
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 16 | 100.00% |  |
Task: 66f2d22f
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 17 | 60.71% |  |
Task: 68b67ca3
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 0 | 0.00% |  |
Task: 6ea4a07e
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 9 | 100.00% |  |
| 1 | 9 | 100.00% |  |
Task: 72207abc
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 81 | 93.10% |  |
Task: 8ba14f53
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 5 | 55.56% |  |
Task: a8610ef7
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 23 | 63.89% |  |
Task: aa18de87
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 36 | 100.00% |  |
Task: b1fc8b8e
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 25 | 100.00% |  |
| 1 | 21 | 84.00% |  |
Task: bbb1b8b6
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 0 | 0.00% |  |
| 1 | 11 | 68.75% |  |
Task: be03b35f
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 1 | 25.00% |  |
Task: ca8de6ea
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 1 | 11.11% |  |
Task: d017b73f
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 16 | 66.67% |  |
Task: e133d23d
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 5 | 55.56% |  |
Task: e345f17b
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 2 | 12.50% |  |
| 1 | 10 | 62.50% |  |
Task: e633a9e5
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 13 | 52.00% |  |
Task: ed74f2f2
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 1 | 11.11% |  |
Task: ed98d772
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 20 | 55.56% |  |
Task: fc754716
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 63 | 100.00% |  |
Submission: submission_finetune4o.json
Task: 00576224
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 36 | 100.00% |  |
Task: 17cae0c1
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 27 | 100.00% |  |
Task: 2072aba6
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 36 | 100.00% |  |
Task: 27a77e38
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 80 | 98.77% |  |
Task: 31d5ba1a
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 10 | 66.67% |  |
| 1 | 9 | 60.00% |  |
Task: 34b99a2b
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 12 | 60.00% |  |
Task: 4cd1b7b2
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 16 | 100.00% |  |
Task: 59341089
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 36 | 100.00% |  |
Task: 62b74c02
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 56 | 100.00% |  |
Task: 66e6c45b
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 16 | 100.00% |  |
Task: 66f2d22f
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 18 | 64.29% |  |
Task: 68b67ca3
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 9 | 100.00% |  |
Task: 6ea4a07e
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 9 | 100.00% |  |
| 1 | 9 | 100.00% |  |
Task: 72207abc
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 87 | 100.00% |  |
Task: 8ba14f53
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 7 | 77.78% |  |
Task: a8610ef7
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 27 | 75.00% |  |
Task: aa18de87
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 36 | 100.00% |  |
Task: b1fc8b8e
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 25 | 100.00% |  |
| 1 | 21 | 84.00% |  |
Task: bbb1b8b6
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 16 | 100.00% |  |
| 1 | 10 | 62.50% |  |
Task: be03b35f
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 2 | 50.00% |  |
Task: ca8de6ea
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 9 | 100.00% |  |
Task: d017b73f
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 17 | 70.83% |  |
Task: e133d23d
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 7 | 77.78% |  |
Task: e345f17b
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 14 | 87.50% |  |
| 1 | 10 | 62.50% |  |
Task: e633a9e5
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 25 | 100.00% |  |
Task: ed74f2f2
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 9 | 100.00% |  |
Task: ed98d772
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 25 | 69.44% |  |
Task: fc754716
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 63 | 100.00% |  |
Submission: submission_finetune4omini.json
Task: 00576224
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 36 | 100.00% |  |
Task: 17cae0c1
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 27 | 100.00% |  |
Task: 2072aba6
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 10 | 27.78% |  |
Task: 27a77e38
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 79 | 97.53% |  |
Task: 31d5ba1a
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 10 | 66.67% |  |
| 1 | 12 | 80.00% |  |
Task: 34b99a2b
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 11 | 55.00% |  |
Task: 4cd1b7b2
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 4 | 25.00% |  |
Task: 59341089
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 16 | 44.44% |  |
Task: 62b74c02
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 44 | 78.57% |  |
Task: 66e6c45b
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 16 | 100.00% |  |
Task: 66f2d22f
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 9 | 32.14% |  |
Task: 68b67ca3
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 9 | 100.00% |  |
Task: 6ea4a07e
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 2 | 22.22% |  |
| 1 | 4 | 44.44% |  |
Task: 72207abc
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 78 | 89.66% |  |
Task: 8ba14f53
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 6 | 66.67% |  |
Task: a8610ef7
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 24 | 66.67% |  |
Task: aa18de87
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 36 | 100.00% |  |
Task: b1fc8b8e
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 25 | 100.00% |  |
| 1 | 25 | 100.00% |  |
Task: bbb1b8b6
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 16 | 100.00% |  |
| 1 | 10 | 62.50% |  |
Task: be03b35f
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 4 | 100.00% |  |
Task: ca8de6ea
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 9 | 100.00% |  |
Task: d017b73f
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 0 | 0.00% |  |
Task: e133d23d
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 8 | 88.89% |  |
Task: e345f17b
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 14 | 87.50% |  |
| 1 | 11 | 68.75% |  |
Task: e633a9e5
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 25 | 100.00% |  |
Task: ed74f2f2
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 6 | 66.67% |  |
Task: ed98d772
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 21 | 58.33% |  |
Task: fc754716
| Pair Index | Correct Pixels | Pixel % | Visualization |
|---|
| 0 | 63 | 100.00% |  |